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Abstract 

Some statistical properties of a network of two-Chinese-character compound words in 
Japanese language are reported. In this network, a node represents a Chinese character 
and an edge represents a two-Chinese-character compound word. It is found that this 
network has properties of "small-world" and "scale-free." A network formed by only Chi- 
nese characters for common use (joyo-kanji in Japanese), which is regarded as a subclass 
of the original network, also has small-world property. However, a degree distribution 
of the network exhibits no clear power law. In order to reproduce disappearance of the 
power-law property, a model for a selecting process of the Chinese characters for common 
use is proposed. 

1 Introduction 

It has been found that a great variety of systems, such as internet [U [2], collaboration in 
science [31 H], food web [HI E], have network structures; systems consist of a group of nodes 
which interact mutually through edges. Network science supplies some methods to understand 
topological structures of such systems. Recently, it has been proved that the properties of small- 
world [3, E] and scale-free [U] are important and that many networks share these properties. For 
typical examples, human languages have been modeled in the framework of complex networks 
so as to investigate graphemic [TO], phonetic [TT] . syntactic p2] and semantic [13] structures. 

Chinese characters are main elements in the writing system of Japanese language. One of 
the most remarkable features of Chinese characters is that they are ideograms, that is, a single 
Chinese character can convey its own meaning. 

Japanese language possesses many words constructed by combining two Chinese charac- 
ters. Such words are called 'two-Chinese-character compound words' (niji-jukugo in Japanese), 
and we adopt the name 'two-character compounds' hereafter. For instance, in the Japanese- 
language dictionary Kojien [Tj] . about 90,000 words among about 200,000 headwords are two- 
character compounds. So far, researches on two-character compounds in Japanese language 
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have been concentrated mostly on morphological structures [151 HE] and cognitive processes 
[T71 HE] ■ However, studies of the two-character compounds in Japanese language based on the 
network science seem to be insufficient. In the present paper, we report analysis results of 
networks of two-character compounds in Japanese language. 

2 Method 

First, we extracted networks of two-character compounds from the following Japanese-language 
dictionaries: Kojien, Iwanami Kokugo Jiten, Sanseido Kokugo Jiten, and Mitsumura Kokugo 
Gakushu Jiten [14J. It is noted that Kojien, Iwanami, and Sanseido are standard dictionaries, 
but Mitsumura is a dictionary for students of elementary and junior high school. We picked 
out two-character compounds from the headwords of each dictionary. 

In the network of two-character compounds, each Chinese character corresponds to a node, 
and each two-character compound formed by connecting two nodes is regarded as an edge. 
Each edge have a direction from an upper character to a lower character. Thus, this network is 
naturally viewed as a directed network with multiple edges and self loops. The direction of edges 
in the network deeply relates to lexical structure and meaning of two-character compounds. The 
multiplicity of edges represents the following two aspects: (i) some two-character compounds 
have two or more readings, and (ii) some compounds become other existing compounds when 
the upper and lower characters are inverted. A part of this network is depicted in Fig. [U 

In the networks we obtained, all nodes are not connective, and whole network is made up 
of 169 {Kojien), 152 [Iwanami), 142 {Sanseido), and 8 {Mitsumura) clusters. In the following 
analysis, we consider the maximal cluster in the network of each dictionary (more than 90% of 
nodes belong to the maximal cluster). Since essential features of the networks can be described 
even without the edge direction and multiplicity and self loops, we focus on the undirected and 
unweighted networks. 

3 Results 

Fundamental results obtained from each dictionary are summarized in Table [U For instance, 
in the case of Kojien, a pair of two nodes is about three steps distant on average, and at most 
ten steps distant (see £ and D in this Table). Clustering coefficient C of each network is about 
20 times greater than that of a random network of the same size in nodes and edges C ran &. 
Therefore, networks of two-character compounds have short path length and high clustring, 
as in many real networks [19J. It is found that the degree distributions of the three networks 
(shown in Fig. [2] (a)-(c)) display power law 

p{k) oc A; -7 , 

where p{k) denotes a fraction of nodes having degree k. Values of 7 are nearly 1 for these three 
dictionaries as shown in Table [U However, as shown in Figj2] (d), the degree distribution of 
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Mitsumura does not exhibit clear power-law property. 



4 Restricted network formed by Chinese characters for 
common use 

In this section, we discuss the reason why the degree distribution of Mitsumura does not exhibit 
power law (see Fig. [2] (d) for reference). There are 1,945 Chinese characters designated for 
common use, which are called joyo-kanji in Japanese, selected by the Ministry of Education, 
Science and Culture of Japan in 1981. We call them 'common-use characters' heareafter. The 
common-use characters are taught during elementary and junior high school in Japan, and 
most Chinese characters used in Japan are the common-use characters. Moreover, Chinese 
characters except the common-use characters are not permitted to use in legal documents. We 
next consider a network constructed only by the common-use characters. It is noted that this 
network forms a subclass of the original network. 

Fundamental results of the network restricted to the common-use characters are summarized 
in Table [21 For the first three dictionaries in Tabled mean path lengths are small, and clustering 
coefficients are large, compared to those presented in Table [U On the other hand, the properties 
of the network of Mitsumura in Table [2] is the same as those in Table [U This reflects that two- 
character compounds listed in Mitsumura are all constructed from the common-use characters 
(recall that this dictionary is for students of elementary school and junior high school). As 
shown in Fig. [3], it is found that the degree distributions of the networks of the common-use 
characters do not show power-law behavior in the four dictionaries. These degree distributions 
share the features that there are plateaus in the range of small k (k < 10) and decay in large 
k(k> 10). 



5 Invasion model for selecting the common-use charac- 
ters 

The property of the degree distributions of the restricted networks shown above is considered 
to be caused by a selection process of the common-use characters. For this process, we pro- 
pose a stochastic model on the 'real' maximal network of each dictionary. First, we assume 
that each node in the network has two states; invaded or uninvaded, and that all nodes are 
initially uninvaded. Then, one node is chosen randomly from the network and is turned into 
invaded. At each time step, one node Vi is chosen with a probability Pi from all uninvaded 
nodes {vi, v 2 , ■ ■ • , v n } connecting to invaded nodes. The probability Pi is assumed to be given 
by 

Pi= sr^n 1 , a (« = !,••• ,n), (1) 

2^=1 K j 
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where kj represents a degree of a node Vj and a is a constant, which is determined below. It is 
noted that the case a = corresponds to random growth, that is, all Vi have equal probability 
of invasion, and that the case a > corresponds to 'preferential' growth, that is, a node of 
larger degree are invaded more easily [9j EH] . The invasion process is schematically shown in 
Fig. HI In this model, invaded nodes are regarded as the common-use characters. The value 
of a should be positive, since the common-use characters tend to have large degrees. Thus, it 
can be said that this model is a preferential growth process of an invaded cluster on a network, 
and it is similar to invasion percolation [21] or Eden model [22J. 

The process of invasion was performed numerically until the number of invaded nodes 
amounted to the size of the network of the common-use characters in the dictionaries ex- 
cept Mitsumura. Then, we calculated (k) for the subnetwork of invaded nodes. To determine 
a, we require that the average degree (k) of the subnetwork of invaded nodes becomes almost 
the same as that of real network of common-use characters, (k) as a function of a is depicted 
in Fig. [5] in the range < a < 2. From this figure, it is suggested that a ~ 1.3 is appropriate 
for the three dictionaries. 

Numerical results are in good agreement with real networks as shown in Table [31 And Fig. 
E] shows that degree distributions obtained from numerical result are also in good agreement 
with those obtained from real networks. 

6 Discussion 

Our analysis has proved that the network of two-character compounds has both small-world 
and scale-free properties. The possibility of emergence of the scale-free property seems to be 
associated with a fitness model [23]. In the fitness model, each node Vi in a network has a 
fitness Xi which is distributed independently and randomly with a given distribution function 
p(x) (fitness generally represents some kind of "importance" or "sociability" of nodes). The 
edge between and Vj is drawn with a probability given by f(xi, Xj) depending on the fitness 
of the nodes involved. And it is known that the fitness model can produce a power-law degree 
distribution. 

For the network of two-character compounds, the frequency of use is uneven for each Chinese 
character: some characters are used quite frequently and some characters are used only in 
particular cases. And, it is naturally thought that creation of two-character compounds between 
Chinese characters used more widely arises more frequently. Hence, there may be an effect 
related to the fitness model so that the network of two-character compounds has the scale-free 
property (a fitness in this case relates to frequency of use). 

We have also found that a network of the common-use characters is connective, and the 
average degree of the network is larger than that of the whole network. The invasion model 
proposed above is a simple method to assure connectivity and large degree of a resultant 
network. The model involves one parameter a, and a growth process of invaded cluster depends 
on the value of a. For positive a, nodes of larger degree are assigned larger invasion probabolity 
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according to Eq. ([T]). Hence most nodes of small degree don't join a network generated from 
the model, and power-law behavior in degree distribution vanishes. Moreover, for appropriate 
value of a, plateau emerges in a range of small k in the degree distribution. It is proved that 
the value of a is nearly 1.3 for the three dictionaries in common, but we have not yet found 
clear explanation for this universality. 

We have confirmed that the network characteristics (Table [T]) and degree distributions (Figs. 
H] and [H]) are essentially the same when the edge direction and multiplicity and self loops are 
took into account. We think that further analysis with direction and multiplicity will provide 
more precise structures of the network of two-character compounds. However, such analysis 
may be rather linguistic or lexical. In fact, direction and multiplicity of edges are closely related 
to the individual meanings of characters and a formation principle of Japanese two-character 
compounds, which is classified into nine types from a grammatical point of view [21] . 

7 Conclusion 

A network constructed by the two-character compounds in Japanese language has short path 
length and high clustering (Table [T]). Also the network has power- law degree distribution 
(Fig. [2]), but a subnetwork restricted to the common- use characters does not show power-law 
distribution (Fig. [3]). Generation of the network of the common- use characters can be modeled 
by an invasion process in which the invasion probability of nodes with degree k is proportional 
to k a (see Eq. (DQ)). The exponent a is determined by consistency between real and numerical 
values of (k) (Fig. It confirmed that the results obtained from the model are consistent 
with real networks quite well (Table [3]). 
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FIGURES & TABLES 

Fig. 1 A part of the network extracted from Kojien: (a) original network, (b) network omitting 
direction, multiple edges, and self loops 

Fig. 2 Degree distribution of the network of each dictionary: (a) Kojien, (b) Iwanami, (c) 
Sanseido, and (d) Mitsumura. In (a)-(c), the solid lines show guidelines of power-law 
behaviors. 

Fig. 3 Degree distributions of networks of the common-use characters: (a) Kojien, (b) Iwanami, 
(c) Sanseido and (d) Mitsumura. 

Fig. 4 An illustration of an invasion process on a network. A number on each node indicates 
a degree of the node, and a fractional number beside each node indicates an invasion 
probability (a = 1) at each time step. The dashed and lines indicate edges of the whole 
network and generated subnetwork, respectively. Black nodes represent that they are 
invaded. 

Fig. 5 Numerical results to determine the value of a. The solid lines represent the average degree 
(k) obtained from the model as a function of a (averaging 50 samples), and dashed lines 
represent (k) shown in Tabale[2J The intersection of solid and dashed lines indicates a ~ 
(a) 1.29 for Kojien, (b) 1.35 for Iwanami, and (c) 1.33 for Sanseido. 

Fig. 6 Degree distributions of real common-use characters corresponding to (al) Kojien, (bl) 
Iwanami, and (cl) Sanseido, and ones obtained from numerical results corresponding to 
(a2) Kojien, (b2) Iwanami, and (c2) Sanseido. (al), (bl), and (cl) are identical to Fig. 
EI(a)-(c). 



Table 1 The characteristics of the maximal cluster in a network of two-character compounds. 

(k), £, D, and C denote average degree, mean path length, diameter, and clustering 
coefficient, respectively. C ran d represents the averaged clustering coefficient of the 50 
random networks of the same size in nodes and edges. 

Table 2 The characteristics of the network of common-use characters. 

Table 3 Comparison between real networks of common-use characters and numerical results (a = 
1.3). Numerical results are obtained by averaging 50 samples. 
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Figure 1: K. Yamamoto and Y. Yamazaki 
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Figure 2: K. Yamamoto and Y. Yamazaki 
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Figure 3: K. Yamamoto and Y. Yamazaki 
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Figure 4: K. Yamamoto and Y. Yamazaki 
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Figure 6: K. Yamamoto and Y. Yamazaki 
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Table 1: K. Yamamoto and Y. Yamazaki 

Dictionary Nodes Edges (k) I D C C ran d 7 
Kojien 5458 74617 27.3 3.14 10 0.138 0.00501 1.04 
Iwanami 3904 32150 16.5 3.31 10 0.085 0.00424 1.04 
Sanseido 3444 28358 16.5 3.32 9 0.086 0.00483 1.05 
Mitsumura 1799 9054 10.1 3.42 8 0.059 0.00255 
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Table 2: K. Yamamoto and Y. Yamazaki 

dictionary Nodes Edges (k) £ D C 
Kojien 1940 54181 55.9 2.32 5 0.172 
Iwanami 1933 26419 27.3 2.67 6 0.111 
Sanseido 1921 24726 25.7 2.73 7 0.114 
Mitsumura 1799 9054 10.1 3.42 8 0.059 
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Table 3: K. Yamamoto and Y. Yamazaki 



dictionary 


Real networks 




Numerical results 


(k) e c 


(k) 


i c 


Kojien 


55.9 2.32 0.172 


56.0 ±0.8 


2.32 ±0.01 0.175 ±0.004 


Iwanami 


27.3 2.67 0.111 


27.2 ±0.2 


2.68 ±0.01 0.109 ±0.005 


Sanseido 


25.7 2.73 0.114 


25.7 ±0.2 


2.73 ±0.02 0.109 ±0.004 
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